Scraping StackOverflow

December 2nd, 2015

admin

StackOverflow
StackOverflow is a Web site for asking and answering programming and software questions. It covers many languages and software. People post questions, other people answer, provide comments. Gradually, consensus and the “best” answer is determined. People get votes for their posts and answers, and establish a reputation.
We want to analyze the questions, answers, comments and who provides each and what their reputations are, and what is the content of the questions and answers. So we need to access the data for posts, questions, comments.
We can access data from StackOverflow in 3 different ways.
• Using the API at https://api.stackexchange.com/
• Downloading a data dump as XML files
• Scraping the pages.
In this assignment, we will scrape the data.
When considering scraping data, it is always important to check
1. whether the data are available in bulk format as a download
2. Whether there is an API which will give the data in more structured format
3. if you are legally entitled to scrape the Web site based on its Terms of Service (ToS)
You also need to be considerate and respectful of the resources and avoid making unnecessary requests. This means carefully test and debugging your code before attempting to make a large number of requests. It is also a very good idea to retrieve each document and store it locally so that you can process it multiple times without having to make another request. This may happen if you find a bug in your code or want to get additional information from the document.
In this assignment, you are to write code to scrape information about posts for a given category, i.e., broad high-level tag. We’ll look at the tag r athttp://stackoverflow.com/questions/tagged/r. Visit this page to see its structure. On this front page, we see a list of the most recent questions that have been posted. We can find “older” questions on subsequent pages using the “next” link at the bottom of the page.
Specific Instructions
Part I Scraping the Summaries of the Posts
1. Process the current summary page of posts, starting with the first page of results
2. For each post, extract information about
o who posted it,
o when it was posted,
o the title of the post
o the reputation level of the poster,
o the current number of views for the post,
o the current number of answers for the post,
o the vote “score” for the post,
o the URL for the page with the post, answers and comments,
o the id (a number) uniquely identifying the post.
3. Obtain the URL for the “next” page listing posts
4. repeat steps 1, 2, 3
Of course, you need to write functions to do the different steps. Your top-level function should allow the caller specify which forum/top-level tag (e.g., r, javascript, d3.js) to scrape. It should also allow the caller to specify a limit on the number of posts to process, either the number of pages or the total number of posts. If this is not specified, it should process all of the pages for this topic/tag.
The function should return a data frame, with a row for each post. The following is an example: Here the tags are separated by ; and combined into a character vector.
Part 2 – Scraping the Posts, Answers and Comments
Next, write a function that processes the actual page for an individual post, i.e., the page containing the post, its answers and comments. The function should extract and combine information for the post, each answer and each comment. For each of these “entries”, we want
1. the type of entry (post, answer, comment)
2. the user,
3. userid,
4. date,
5. user’s reputation,
6. the score/votes for this entry,
7. the HTML content as a string for this entry,
8. the identifier of the “parent” entry, i.e., which entry this is a response to – a comment to an answer, or an answer to a post,
9. the unique identifier for the overall post.
Again, create a data frame to store all of the posts, questions and comments. A sample result data frame is here. (Use load() to read it into R.)
NOTE: Rather than running your function on all of the URLs from Step 1, use this data for the analysis below. Do however test your function on several URLs.
Part 3 Analyzing R Questions on StackOverflow
Use this data for the analysis below to answer a few short questions. Use load() to read it into R.
1. What is the distribution of the number of questions each person answered?
2. What are the most common tags?
3. How many questions are about ggplot?
4. How many questions involve XML, HTML or Web Scraping?
5. What are the names of the R functions referenced in the titles of the posts?
6. What are the names of the R functions referenced in the accepted answers and comments of the posts? We can do better than we did in processing the title of the posts as there is HTML markup and we can find content in code blocks.
Potentially Useful Functions
htmlParse()
getURLContent(), getForm()
getNodeSet(), xpathApply(), xpathSApply().
xmlName(), xmlGetAttr(), xmlAttrs(), xmlValue(), xmlSize(), xmlChildren(), xmlApply(), xmlSApply()
saveXML() (to convert an HTML tree to a string and/or write to a file)
getRelativeURL() (for getting the full path of a URL given a relative link and the base URL) docName() to get the filename/URL of the XML/HTML document
try(), tryCatch()
file.exists()
Books, Tutorials, etc.
There are lots of tutorials and examples on-line. And there are good short books and case studies.
• XPath and XPointer, John Simpson. O’Reilly.
• XML and Web Technologies with R, Nolan & Temple Lang.
• Chapter 4, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining Simon Munzert, Christian Rubba, Peter Meißner & Dominic Nyhuis
• Chapter 12, “Exploring Data Science Jobs with Web Scraping and Text Mining” Data Science in R: A Case Studies Approach, Nolan & Temple Lang
• W3Schools XPath tutorial

Posted in Uncategorized

Responses are currently closed, but you can trackback from your own site.

Comments are closed.

Categories

Recent Posts

Recent Comments

Scraping StackOverflow

FREE FEATURES